Purpose(s)

  • Get feedback on AI evaluation system

  • Research design improvements

  • Future ideas for VR-based research

  • Maybe nothing? Maybe we just chill?

High-level Research Overview

  1. Examine the effectiveness of AI-generated feedback in VR training programs
    1. Does AI-generated feedback help people learn better than alternative methods?
    2. Does AI-generated feedback help people learn faster?
  1. Understand how AI-generated feedback interacts with several individual differences

Virtual Reality

“A three dimensional digital representation of a real or imagined space with interactivite capabilities(Howard et al., 2021).

Virtual Reality

Comprised of 3 components:

Virtual Reality

Comprised of 3 components:

  1. Dimensionality
  1. Representation
  1. Influence

Virtual Reality

Comprised of 3 components:

  1. Dimensionality
  1. Representation
  1. Influence

Virtual Reality

Virtual reality programs are three-dimensional

Virtual Reality

Two dimensional programs would not be considered VR

Virtual Reality

2.5D games would not be considered VR

Virtual Reality

Three dimensional videogames would be considered VR

Virtual Reality

Comprised of 3 components:

  1. Dimensionality
  1. Representation
  1. Influence

Virtual Reality

Comprised of 3 components:

  1. Dimensionality
  1. Representation
  1. Influence

Virtual Reality

VR users navigate their surroundings with avatars

Virtual Reality

Comprised of 3 components:

  1. Dimensionality
  1. Representation
  1. Influence

Virtual Reality

Comprised of 3 components:

  1. Dimensionality
  1. Representation
  1. Influence

Virtual Reality

All VR programs allow some form of interaction

Virtual Reality

All VR programs:

  1. Are three-dimensional
  2. Have visible or implied avatars
  3. Have some form of interactivity

VR Training

The process of utilizing VR to teach various knowledge and skills to students, employees, or any other person interested in learning.

VR Training

Began out of a desire to create safe and cost-effective training programs in military and medical fields (Howard et al., 2021; McCarthy & Hollands, 1998; Moshell, 1993; Orlansky & String, 1982)

Since this time, VR training programs have grown to fulfill a variety of use cases in a multitude of industries.

VR Training Examples

Bissonnette et al. (2019)

VR Training Examples

VR Training Examples

Le Noury et al. (2021)

VR Training Examples

Ropelato et al. (2018)

VR Training Examples

VR Training

Table 1

Meta-Analytic Evidence of VR Training's Effectiveness

Author

k

N

Domain

Finding

Haque & Srinivasan (2006)

16

Medical

VR training resulted in significantly less task completion time than alternative methods, with a large effect (d = −2.175, 95% CI [−3.865, −.485]).

Howard & Guttworth (2020)

23

1530

Social Skills

VR training programs were more effective than alternative methods for developing social skills (d = 0.691, 95% CI [0.297, 1.085], z = 3.435, p = .001).

Howard et al. (2021)

184

9007

Variety of Domains

VR training (in general) was more effective than alternative training methods (d = 0.541, 95% CI [.450, .631], z = 11.648, p < .001 This effect grew larger when filtering by organizational training programs, d = .625.

F.-Q. Chen et al. (2020)

12

821

Medical

VR training was more effective in improving knowledge (standard mean difference [SMD] = 0.58, 95% CI [.41, .75], p <.001, I2 = 47%) than alternative trainings. Both trainings were equally effective at improving skills (SMD = .01, 95% CI [–.24, .26], p =.93, I2 = 37%).

Theories for VR Training’s Effectiveness

Summary of Theories Explaining VR Training Effectiveness

Theory

Core_Idea

Key_Evidence

Representative_Citations

Immersion–Presence Theory

VR’s immersive qualities (HMDs, sensory engagement, vivid environments) increase presence, which enhances attention, engagement, and learning.

Immersion → Presence (moderate effect). Presence → higher engagement, motivation, and learning.

Slater & Wilbur (1997); Cummings & Bailenson (2016); Dubovi et al. (2017)

Task–Technology Fit

Training is most effective when VR technology closely matches the real-world tasks, tools, and motor actions required.

High task–technology fit shows significantly larger performance effects (d = .82) than moderate (d = .45) or low (d = .21).

Goodhue & Thompson (1995); Howard et al. (2021)

Novelty Effect

VR’s novelty increases motivation and engagement early on, potentially elevating performance effects—though these effects may diminish over time.

Meta-analytic evidence shows effect sizes shrinking across years, possibly due to reduced novelty or wider application contexts.

Freina & Ott (2015); Pan et al. (2006); Howard et al. (2021)

Artificial Intelligence

Artificial intelligence refers to “the development of … machines that can simulate human intelligence and perform tasks that typically require human intelligence, such as perception, reasoning, and decision making” (Gligorea et al., 2023, p. 3)

Artificial Intelligence

Can be as simple as if-then statements or decision trees…

Artificial Intelligence

… or as complex as machine learning (ML), natural language processing (NLP), large language models (LLMs)

AI is typically applied to VR training using 4 methods:

  1. Assessment and evaluation (Bissonnette et al., 2019; Sadeghi Esfahlani et al., 2020; Yilmaz et al., 2022)

  2. Adaptive learning systems (Lyu et al., 2021; Ropelato et al., 2018)

  3. Learner profiling and prediction (Bissonnette et al., 2019)

  4. Development of intelligent tutoring systems (Chen et al., 2022; King et al., 2022; Li et al., 2020; Sadeghi Esfahlani et al., 2020)

The Role of AI in VR Training

Assessment and Evaluation

Yilmaz et al. (2022)

The Role of AI in VR Training

Adaptive Learning Systems

Ropelato et al. (2018)

The Role of AI in VR Training

Profiling and Prediction

Bissonnette et al. (2019)

The Role of AI in VR Training

Intelligent Tutoring Systems

Truong et al. (2022)

Qi et al. (2021)

The Role of AI in VR Training

DeFabiis et al. (2025) found that only 2 of 23 studies examining AI in VR training contexts assess its effects on training outcomes

Truong et al. (2022) and Qi et al. (2021) found that AI feedback improved learning outcomes compared to no feedback in specific VR training scenarios

Hypothesis

Hypothesis 1: Trainees who receive AI-generated personalized feedback during a virtual reality training program will demonstrate higher learning outcomes than trainees who receive generic feedback, who in turn will demonstrate higher learning outcomes than trainees who receive no feedback.

Hypothesis 2: Trainees who receive AI-generated personalized feedback during a virtual reality training program will exhibit faster learning over time than trainees who receive generic feedback or no feedback.

The VR Training Program

The VR Training Program

Two Training Programs:

1) Providing effective feedback to subordinates

2) Setting effective goals with subordinates

‘Talk, show, do’ format

The VR Training Program

The VR Training Program

The VR Training Program

The VR Training Program

The VR Training Program

The VR Training Program

The VR Training Program: Role Play Conversational Framework

Example Conversation

USER: Hey Amanda, why did you let that person in without scanning their membership?

NPC: I’m sorry about that. It was a mistake; I let my friend in without thinking.

USER: That’s alright, sometimes mistakes happen, it’s just, you know, whether they’re your friend or not, we just want to make sure we’re following company policy and only letting people in with a membership. And scanning the membership is the way that we know that we’re doing that. So let’s just make sure from now on we’re scanning memberships. Does that sound okay?

NPC: Absolutely, I understand and will make sure to scan memberships from now on. Thanks for pointing that out.

USER: Okay, great, it sounds like we’re all set here. Just maybe reflect on this conversation, just remember the importance of scanning memberships. It’s not just making sure people aren’t stealing from us, it’s also about security, making sure that the right people are in the gym at the right time, and things like that. You know, just something to reflect on so we make sure it doesn’t happen again.

NPC: I completely understand, and I’ll definitely keep that in mind. Thanks for the reminder.

Example Conversation 2

USER: Hey girl, what are those?

NPC: Oh, hey! These are just some of the new promotional flyers for our upcoming gym event. Want me to grab you a copy?

USER: Actually, I was talking about those beautiful eyes.

NPC: Oh, haha, thank you! That’s really sweet of you to say.

USER: I’m not sure what you’re doing after work, but I heard there’s a Chuck-E-Cheese that opened up down the street if you’re trying to get some pizza

NPC: That sounds fun, but I’m probably just going to relax at home tonight. Thanks for the invite, though!

USER: Are you sure? I mean, maybe we could get some fast food instead. There’s a McDonald’s nearby. Maybe we could get a McFlurry. I’m saving for a Yu-Gi-Oh! LOB 1st Edition factory-sealed booster box with unweighed packs, though, so we’d have to share.

NPC: I appreciate the offer, but I think I’ll stick to my original plan tonight. Thanks for thinking of me!

USER: You’re fired.

NPC: I’m sorry if I upset you in any way. Could we talk about this further?

USER: I don’t want to talk about it. I honestly don’t want to talk about it. I showed up to work in my nicest fedora and all I wanted was to share a pizza with you at Chuck E. Cheese that I would have bought with my own money. Or even a McFlurry at McDonald’s. And you wouldn’t do either one of those things. Do you know how that makes me feel?

NPC: I’m really sorry if I hurt your feelings, that wasn’t my intention at all. I’d love to discuss how we can move past this if you’re open to it.

USER: Please go out with me. Please. You don’t know what it’s like owning a fedora. I need this date. I really need this date. Please.

NPC: I appreciate the invitation, and I think you’re great, but I’m not interested in anything beyond friendship. I hope we can still be cool!

USER: Yeah, sure, we could still be cool. By the way, pack your things and get out of my gym, you’re fired. Merry Christmas, you filthy animal.

The VR Training Program: AI Evaluation Framework

AI Evaluation Initial Reliability and Agreement Assessment

The VR Training Program

Feedback: No Feedback Condition

Feedback: Generic Condition

“Remember to make your feedback task-relevant, specific, participative, and reflective.”

Feedback: AI Condition

The feedback provided was task-focused and addressed the importance of scanning memberships (Task-Relevant: 5). While the user did not overtly seek participation from Amanda, they did engage her minimally by confirming understanding (Participative: 3). The feedback was generally clear, providing a specific action for improvement (scanning memberships) and explaining its importance (Specific: 4). The user encouraged reflection by discussing the broader implications of policy adherence, which is indicative of reflective feedback, though it could have been more in-depth (Reflective: 4). Overall, the feedback was effective, with some room for improvement in encouraging participative dialogue.

Feedback: AI Condition Example 2

The interaction is entirely inappropriate for a professional context. The feedback given does not focus on any job-related tasks or performance but revolves solely around personal characteristics and social interactions, making it ineffective (Task-Relevant Score: 1). There is no encouragement for participation in the feedback process as the user shuts down any attempt by the NPC to engage (Participative Score: 1). The feedback lacks specificity, clear strategies, or any pertinent work-related guidance (Specific Score: 1). There is a complete absence of encouragement for self-reflection on job performance (Reflective Score: 1). Overall, the conversation is not relevant to a professional setting or feedback process.

Sample

Will recruit 100 participants over the age of 18 (Maas & Hox, 2005)

Measures: Cognitive Ability

ICAR16 (Condon & Revelle, 2014)

Contains four items each from four domains: verbal reasoning, letter and number series, three-dimensional rotation, and matrix reasoning

EFA demonstrated very good fit for a four-factor solution (Root Mean Squared Error of Approximation [RMSEA] = .014, Root Mean Square of the Residual [RMSR] = .01), Tucker Lewis Index [TLI] = .99).

Cronbach’s alpha of .81 and an omega total of .83

Concurrent and Construct Validity

The ICAR16 was found to correlate .59 with combined SAT scores and .52 with the ACT composite score

After correcting for both restricted range and reliability, the ICAR Sample Test showed correlations of .82 and .81 with the two respective composites (A and B) of the commercial cognitive ability measure, the Shipley-2.

The ICAR items demonstrated the ability to differentiate based on university majors; the rank ordering of mean ICAR scores by major correlated .75 with published rank orderings of mean combined SAT scores by major, and .86 with mean combined GRE scores by major.

Measures

Summary of Individual Difference and Reaction Measures

Measure

Instrument

Items.Facets

Response.Scale

Psychometrics.Notes

Conscientiousness

Big Five Inventory–2 (BFI-2) Conscientiousness domain (Soto & John, 2017)

12 items; three 4-item facets: Organization, Productiveness, Responsibility.

5-point scale: 1 (Disagree strongly) to 5 (Agree strongly). Content-balanced (equal true-keyed/false-keyed items).

High internal consistency; α ≈ .86–.88 across validation samples. Hierarchical facet structure designed to increase descriptive precision and predictive power.

Novelty

Internal VR novelty scale (current study)

Multi-item internal scale. If internal consistency is inadequate, a single binary item will be used: “Are you relatively new to VR?” (yes/no).

Primary: Likert-type items (see Appendix E); internal consistency evaluated post-training. Fallback: binary yes/no item.

Internal consistency (e.g., Cronbach’s α) will be estimated post-training. If reliability is inadequate, analyses will use the binary novelty item instead of the scale.

Presence

iGroup Presence Questionnaire (IPQ; Schubert et al., 2001)

13 items; three components: Spatial Presence (5), Involvement (4), Realness (3), plus one general presence item.

5-point scales with anchors varying by item type (e.g., degree of presence, involvement, realness).

Three-component structure supported by EFA/CFA with good fit indices (e.g., RMSEA ≈ .02–.04). Factor loadings ~ .46–.88; KMO ≥ .83–.92. Replicated factor structure across heterogeneous samples and media types.

Self-Efficacy

Internal self-efficacy for feedback and goal-setting scales (Bandura et al., 2006 guidelines)

Two domain-specific scales: Feedback self-efficacy (4 items: task-relevant, participative, specific, reflective feedback) and goal-setting self-efficacy (3 items: specific/measurable goals, appropriate difficulty, participative goals).

11-point scale: 0 (Not confident at all) to 10 (Very confident). Rated immediately post-training for each domain.

Items aligned with trained performance dimensions and corresponding BARS. Scales designed following Bandura’s guidelines for domain-specific self-efficacy; internal consistency will be assessed prior to substantive analyses.

Simulation Sickness

Simulator Sickness Questionnaire (SSQ; Kennedy et al., 1993)

16 items; three subscales: Nausea, Oculomotor, Disorientation (some items load on multiple subscales); Total Severity score computed from all subscales.

4-point scale: 0 (None), 1 (Slight), 2 (Moderate), 3 (Severe), rated based on post-training experience.

Three-factor structure supported by factor analysis with acceptable eigenvalues and variance explained; subscales scaled to SD = 15 in calibration sample (~1,100 observations). Demonstrated utility in distinguishing high- vs. low-symptom simulators.

Utility Reactions

Internal Utility Reactions scale (current study; based on Kirkpatrick & Kirkpatrick, 1994; Alliger et al., 1997)

3 items: perceived preparation for real-world situations, perceived usefulness of skills/information, and intentions to use skills in the future.

5-point scale: 1 (Strongly disagree) to 5 (Strongly agree).

Constructed to assess a specific Level 1 criterion facet (utility). Prior meta-analytic work indicates utility reactions relate more strongly to learning and transfer than affective reactions such as enjoyment.

Other Personality Variables

BFI-2 Openness, Extraversion, Agreeableness, Neuroticism (Soto & John, 2017)

Four 12-item domain scales for Openness, Extraversion, Agreeableness, Neuroticism (standard BFI-2 domain scales).

5-point scale: 1 (Disagree strongly) to 5 (Agree strongly).

BFI-2 domains exhibit strong internal consistency and well-validated factor structure in large, diverse samples. Used as control/exploratory personality covariates.

Note. BFI-2 = Big Five Inventory–2; IPQ = iGroup Presence Questionnaire; SSQ = Simulator Sickness Questionnaire. Internal scales (novelty, self-efficacy, utility) will have internal consistency evaluated prior to inclusion in primary analyses.

Data Analytic Plan

Preliminary Analyses

Preliminary Analyses

Preliminary Analyses

Preliminary Analyses

Preliminary Analyses

Table 6

Coding and Centering Strategy

Variable

Level

Coding or Centering Strategy

Interpretation

Cognitive Ability

2

Grand-mean centered

0 = sample average

Conscientiousness

2

Grand-mean centered

0 = sample average

AI Feedback

2

Dummy coded (0/1)

0 = generic, 1 = AI-generated

Training Principles

2

Dummy coded (0/1)

0 = absent, 1 = present

Time (Trial #)

1

First trial = 0; subsequent trials increase by 1

0 = first trial

“For every 1-unit above the sample mean in cognitive ability, expected performance changes by β units.”

Preliminary Analysis

Table 7

Multicollinearity Diagnostic Criteria

Statistic

Acceptable Range

Concern Threshold

Problematic Threshold

Tolerance

> .20

< .20

< .10

Variance Inflation Factor (VIF)

< 5

5–10

> 10

Bivariate Correlation (r)

< |.70|

|.70| – |.85|

> |.85|

Note. This will be taken for cognitive ability, conscientiousness, and relevant exploratory variables.

Preliminary Analyses

Assumptions Testing

Summary of Multilevel Model Assumptions, Diagnostics, and Remedies

Assumption

What is assessed

Diagnostics

Indicators of violation

Potential remedies

Normality

Distribution of Level 1 residuals and empirical Bayes estimates of random effects (random intercepts and, if included, random slopes)

Q–Q plots and histograms; skewness and kurtosis statistics

|skewness| > 2 or |kurtosis| > 2 and clear visual departures from normality

Rely on robustness of MLM; if severe, consider robust standard errors or alternative distributional families (e.g., GLMM)

Homoscedasticity

Stability of Level 1 residual variance across fitted values and time (trial number)

Plots of standardized residuals versus fitted values and versus time

Funnel shapes or systematic increases/decreases in residual spread across fitted values or time

Model residual variance as a function of time and/or use heteroscedasticity-consistent (robust) standard errors

Linearity

Functional form of the relationship between predictors (e.g., time) and the outcome at Level 1

Plots of standardized residuals versus fitted values and versus time

Curvature or other non-random patterns in residuals suggesting mis-specified functional form

Add nonlinear terms (e.g., quadratic time) or theoretically justified interactions

Autocorrelation

Temporal dependence of Level 1 residuals across consecutively ordered trials within participants

Autocorrelation function (ACF) plots of Level 1 residuals

Clearly non-zero lag-1 autocorrelation (and potentially higher lags) in ACF plots

Add a first-order autoregressive [AR(1)] error structure and retain it only if it improves model fit (LRT, AIC, BIC)

Note. Assumptions will be evaluated using residual- and random-effect–based diagnostics. When violations are detected, model-based and/or robust adjustments will be considered and transparently reported.

Main Analyses

AI-Generated Feedback x Time

Determines whether AI-generated feedback helps trainees improve faster than those who do not receive AI-generated feedback

AI-Generated Feedback x Cognitive Ability

Determines whether AI-generated feedback benefits lower or higher cognitive ability individuals more

Exploratory Analysis

Exploratory Analyses: Performance Dimension Specificity

Step 1: Test AI-Generated Feedback x Performance Dimension interaction

Step 2: Four separate models to assess how AI-Generated feedback affects each specific performance dimension

Exploratory Analyses

Summary of Theories Explaining VR Training Effectiveness

Theory

Core_Idea

Key_Evidence

Representative_Citations

Immersion–Presence Theory

VR’s immersive qualities (HMDs, sensory engagement, vivid environments) increase presence, which enhances attention, engagement, and learning.

Immersion → Presence (moderate effect). Presence → higher engagement, motivation, and learning.

Slater & Wilbur (1997); Cummings & Bailenson (2016); Dubovi et al. (2017)

Task–Technology Fit

Training is most effective when VR technology closely matches the real-world tasks, tools, and motor actions required.

High task–technology fit shows significantly larger performance effects (d = .82) than moderate (d = .45) or low (d = .21).

Goodhue & Thompson (1995); Howard et al. (2021)

Novelty Effect

VR’s novelty increases motivation and engagement early on, potentially elevating performance effects—though these effects may diminish over time.

Meta-analytic evidence shows effect sizes shrinking across years, possibly due to reduced novelty or wider application contexts.

Freina & Ott (2015); Pan et al. (2006); Howard et al. (2021)

References

Bissonnette, V., Mirchi, N., Ledwos, N., Alsidieri, G., Winkler-Schwartz, A., & Del Maestro, R. F. (2019). Artificial intelligence distinguishes surgical training levels in a virtual reality spinal task. The Journal of Bone and Joint Surgery. American Volume, 101(23), e127.
Chen, Y.-L., Hsu, C.-C., Lin, C.-Y., & Hsu, H.-H. (2022). Robot-assisted language learning: Integrating artificial intelligence and virtual reality into english tour guide practice. Education Sciences, 12(7), 437.
Condon, D. M., & Revelle, W. (2014). The international cognitive ability resource: Development and initial validation of a public-domain measure. Intelligence, 43, 52–64.
DeFabiis, M., Askew, K., & Sessa, V. (2025). Can AI enhance VR training? A systematic review of AI-VR training research. Presentation at the Society for Industrial and Organizational Psychology Annual Conference, Denver, CO, United States.
Gligorea, I., Cioca, M., Oancea, R., Gorski, A.-T., Gorski, H., & Tudorache, P. (2023). Adaptive learning using artificial intelligence in e-learning: A literature review. Education Sciences, 13(12), 1216.
Howard, M. C., Gutworth, M. B., & Jacobs, R. R. (2021). A meta-analysis of virtual reality training programs. Computers in Human Behavior, 121, 106808.
King, S., Boyer, J., Bell, T., & Estapa, A. (2022). An automated virtual reality training system for teacher-student interaction: A randomized controlled trial. JMIR Serious Games, 10(4), e41097.
Le Noury, P., Buszard, T., Reid, M., & Farrow, D. (2021). Examining the representativeness of a virtual reality environment for simulation of tennis performance. Journal of Sports Sciences, 39(4), 412–420.
Li, J., Mei, X., Wang, J., Xie, B., & Xu, Y. (2020). Simulation experiment teaching for airport fire escape based on virtual reality and artificial intelligence technology. 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology (ICCASIT, 1014–1017.
Lyu, Z., Li, J., & Wang, B. (2021). AIive: Interactive visualization and sonification of neural networks in virtual reality. 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), 251–255.
Maas, C. J., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 86–92.
McCarthy, A. D., & Hollands, R. J. (1998). A commercially viable virtual reality knee arthroscopy training system. Medicine Meets Virtual Reality, 302–308.
Moshell, M. (1993). Three views of virtual reality: Virtual environments in the US military. Computer, 26(2), 81–82.
Orlansky, J., & String, J. (1982). Cost-effectiveness of maintenance simulators for military training. EhEmhmhhEohhhI, 127.
Qi, D., Ryason, A., Milef, N., Alfred, S., Abu-Nuwar, M. R., Kappus, M., De, S., & Jones, D. B. (2021). Virtual reality operating room with AI guidance: Design and validation of a fire scenario. Surgical Endoscopy, 35, 779–786.
Ropelato, S., Zünd, F., Magnenat, S., Menozzi, M., & Sumner, R. (2018). Adaptive tutoring on a virtual reality driving simulator. International SERIES on Information Systems and Management in Creative Emedia (CreMedia), 2017(2), 12–17.
Sadeghi Esfahlani, S., Izsof, V., Minter, S., Kordzadeh, A., Shirvani, H., & Esfahlani, K. S. (2020). Development of an interactive virtual reality for medical skills training supervised by artificial neural network. Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 2, 473–482.
Truong, H., Qi, D., Ryason, A., Sullivan, A. M., Cudmore, J., Alfred, S., Jones, S. B., Parra, J. M., De, S., & Jones, D. B. (2022). Does your team know how to respond safely to an operating room fire? Outcomes of a virtual reality, AI-enhanced simulation training. Surgical Endoscopy, 1–9.
Yilmaz, R., Winkler-Schwartz, A., Mirchi, N., Reich, A., Christie, S., Tran, D. H., Ledwos, N., Fazlollahi, A. M., Santaguida, C., Sabbagh, A. J., et al. (2022). Continuous monitoring of surgical bimanual expertise using deep neural networks in virtual reality simulation. NPJ Digital Medicine, 5(1), 54.